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ABSTRACT 


Clustering objects into synthetic groups is a natural activity of any science. Astrophysics is 
not an exception and is now facing a deluge of data. For galaxies, the one-century old Flubble 
classification and the Flubble tuning fork are still largely in use, together with numerous mono- 
or bivariate classifications most often made by eye. Flowever, a classification must be driven by 
the data, and sophisticated multivariate statistical tools are used more and more often. In this 
paper we review these different approaches in order to situate them in the general context of 
unsupervised and supervised learning. We insist on the astrophysical outcomes of these studies 
to show that multivariate analyses provide an obvious path toward a renewal of our classification 
of galaxies and are invaluable tools to investigate the physics and evolution of galaxies. 
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1 INTRODUCTION 


Astrophysics has always adopted specific strategies to classify a relatively modest amount of diversity 
and has much counted on the physics to define the discriminant parameters. This discipline is now facing 
the need for sophisticated statistical tools to tackle the astronomical number of observed and catalogued 
objects and the increasing number of observed properties that describe them. 


The debate about th e usefulness of the morphological classification of galaxies is a rather old one and 
is still alive. Sandage (|Sandage[|2005|), a proponent of a (morphological) classification driven by the data, 
noticed that the Hubble cfassihcation and the Hubble tuning fork have not yet been rep laced by anything 
else despite the efforts of the proponents of a classification driven by the physics (e.g. IConselicej |2006j). 
It also has been recognised to have many flaws: it is a qualitative, subjective and visual approach, difficult 
to use for distant galaxies, it is based solely on the visible morphological parameter while galaxies are 
complex and evolving systems and while we have at our disposal morphologies from X-rays to radio 
wavelengths, spectra, chemical compositions, stellar populations, central black hole masses, kinematics 
of stars and gas... 
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However this debate may not address the right question sinee from a elassifieation point of view, a clas- 


sifieation must be driven by the data, and thus be multivariate (e.g. Fayyad et al., 1996). Consequently, 


adapted tools must be used whieh are not well known to astronomers in general. Nevertheless, numer¬ 
ous studies have been published during the last thirty years or so, especially since the beginning of the 
XXIst century. In this paper we would like to present these different approaches in the general context of 
unsupervised (clustering) and supervised (classification) learning. 

Clustering approaches gather objects according to their similarities either through the choice of a dis¬ 
tance metric or using some adequate criteria for deciding to which cluster some object belongs. There 
is a huge class of techniq ues that partition the data into a pre-defin ed number of clusters. A well-known 
algorithm is the k-means (MacQueenj 1967[ Ghosh and Liul 20101. 


Another family of clustering techniques uses a hierarchical representation of the pairwise distances 
between objects in terms of a number of parameters (variables), through a bottom-up algorithm that 
constructs a tree by relating the closest objects together before relating these first clustering to closest 
clusters or objects, and so forth until the whole sample is exhausted. The final number of groups is then 
chosen by cutting the tree at a fixed distance level. The branches of the tree, called a dendrogram, may or 
may not represent relationships between the objects. 

Originally, phylogenetic methods are designed to build a graph representing the ev olutionary relation¬ 
ships between species (see reviews in |Felsenstein[ |2003| |Makarenkov et aL[ |2006|). Each node of the 
graph indicates a transmission with modification mechanism that creates two or more species inheriting 
from a common ancestor. More generally, a phylogenetic approach can be viewed as an unsupervised 
clustering approach in which relationships are provided. As a consequence, phylogenetic techniques are 
particularly versatile and powerful methods for building cla ssification trees. They can be understood in 
the framework of the graph theory ([Semple and Steel[|2003|). 


There are two kinds of phylogenetic methods, based either on the pairwise distances (or dissimilarities) 
computed from the parameters describing the objects, or on these parameters themselves. 

The distance-based methods build the tree entirely from the distances, putting forward the global sim- 
ilarities between t he objects. The friends-of-friends algorithm is relatively famous in astrophysics (e.g. 
More et al.[|2011[ and references therein). Also known as the single linkage or Nearest Neighbor algo- 
rithm. It IS mathematically related to the M inimum Spanning Tree technique which looks for the simplest 


graph connecting the objects under study (Gower and Ross[ 1969; Feigelson and Babu| 20121. A more 
sophisticated approach used in phylogenetic studies is the Neighbor-Joining Tree technique (jSaitou and 
Nei[|1987]|Gasciiel and Steel]|2006| ). 


In the parameter-based methods, the parameters are called characters which in astrocladistics corre¬ 
spond to the parameters associated to the physical measurements of some properties of the objects. The 
parameter-based methods evaluate all possible trees that can be constructed with the objects, and select 
the tree(s) corresponding to an optimization criterion. The process is thus based on the distribution of the 
parameter values. 

Parameter-based methods can describe a larger variety of evolutionary scenarios and are thus more gen¬ 
eral that the distance-based methods. But this is at the cost of a larger computation time which quickly 
becomes prohibitive. Mathematically, formal connections bet ween parameter- and distance-based meth - 
ods are developed in the case of continuous parameters (e.g. [Thuillard and Fraix-Bumet[ [2009 [ |2015|), 
explaining why both kinds of methods are successfully used in phylogenetic studies. 

Among the par ameter-based t echniques, cladistics is the most famous one. Invented in the 1950’s by 
William Hennig (|Hennig[|1965|), its principle looks simple: two (or more) objects are related if they share 
a common history, that is they possess properties inherited from a common ancestor. In practice, a cladis- 
tic analysis asks for the objects under study to be described by evolutionary characters (parameters or 
descriptors) for which at least two states are defined: one is said to be ancestral, the other one is said to 
be derived. The derived state corresponds to an innovation in the evolution and is assumed to have been 
acquired by an unidentified ancestor. This is the transmission phase of inheritance making descendants 
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look similar to their parents. The aeeidents in this proeess are ealled modifieations and generate diversity. 
This transmission with modifieation proeess was invoked by Darwin to explain the observed hierarehi- 
eal organisation of the biologieal diversity. Several approaehes have been developed to seareh for the 
best tree representation using Maximum Likelihood, some Bayesian approaehes or Maximum Parsimony. 
In Maximum Parsimony, one searehes for the tree representation of the data with the smallest number 
of evolutionary steps to explain the data. But in essenee, any entity, be it biologieal or not, evolving 
with a transmission with modifieation proeess ean be a priori studied by Maximum Parsimony, provided 
evolutionary states ean be defined for the eharaeters. 


A more general representation of relationships are given by networks even though their interpretation is 
quite eomplex, but they ean be approximated by several trees. 


In this review, we do not intend to present all possible teehniques in both supervised and unsupervised 
learning. Rather, we foeus on the astrophysieal published studies made with the objeetive of diseovering 
struetures in a data set, in other words a new elustering and possibly a new elassifieation of gala xies, 
beyond the tradition al Hubble morphologieal seheme. We refer the reader to the eomplete review by |Ball 
and Brunner| (|2010j) on data mining tools used in astrophysies for further information and referenees m 
partieular on tire separation of sourees or the elassifieation of galaxies into morphologieal types. Our paper 
is mainly devoted t o unsupervised elassifieat ion (elustering) and presents the phylogenetie methods whieh 
are not ineluded in|Ball and Brunner] (|2010|). In addition we insist on the astrophysieal outeome and the 
new insights that sueh studies have brought to our knowledge on galaxy physios and diversity. 


Part of this paper is inspired from De et al. (2013) whieh oompares the applieability of some of the 
elustering teehniques on the basis of Gaussian and non Gaussian astronomioal data sets. Here we do not 
make sueh a eomparison. 


The paper is organized as follows. The first seetion presents a frequent prerequisite to data mining, 
the dimension reduetion (Seot. |^. This approaeh has been heavily used in the extragalaetie literature to 
identify groups in the reduoed oomponent spaoe, the motivation being mainly for automatio elassifieation 
in large data sets. The seoond seetion desoribes the important differenoe between this motivation, ealled 
(supervised) elassifieation, and the elustering (unsupervised elassifieation) whieh is the main topie of this 
paper (Seet.j^. We also diseuss shortly the eoneept of similarity between objeets. 


Partitioning methods divide the sample into distinet groups. This ean be made with hard or soft bounds 


depending on whether the membership is a probabiliL 
Neighbor (Seet.|^, Support Veetor Maehine (Seet.[5D 


PJ 


|2010). The fc-Nearest 
s are of the first kind. 


or not (see e.g. [Andrae efal 
and k-means (Seet. metho^ 

The fuzzy elustmng approaeh (Seet.|7) belongs to'llie soft partitioning techniques and often extends the 
applicability of the previous methods. The Information Bottleneck approach is able to provide both kinds 
of classification (Sect.[^. 


These partitioning methods require the number of classes as an input. Some other techniques try to fit 
some distributions to the data set, the optimization process providing the number of groups best fitting the 
data. These techniques are based on mixture model (Sect.|^ and wavelet (Sect.[^ methods. 


A different category of clustering approaches establishes relationships between the objects and derive 
the groups from the generated graph. The first such category are the hierarchical methods (Sect. 
which build a tree based on the pairwise distances. Different cuts on the tree result in different numbers 
of classes. These cuts can be chosen on the basis of objective arguments but also may vary according 
to the goal of the analysis since the tree provides a synthetic view of the structures within the data set, 
instead of just the group memberships. Another kind of graphs are the networks produce by the Minimum 
Spanning Tree method (Sect. [^. The last kind of relationships are evolutionary relationships. This is 
the domain of the phylogenetic techniques, a very wide subject of bioinformatics. We here present only 
the Maximum Parsimony (cladistics), Neighbor-Joining Tree Estimation and Outer Planar Networks that 
have been applied in the context of galaxies (Sect.[T3j). 


Frontiers 


3 

















Fraix-Bumet et al. 


Multivariate classification of galaxies 


2 DIMENSION REDUCTION APPROACHES 

2.1 METHODS 


When the data set is large (both in terms of number of variables and number of observations) one may 
first apply some appropriate dimension reduetion teehnique and then perform elustering on the redueed 
data set. 


One must keep in mind that the diseriminant usefulness of distanees is lost in high dimension parameter 
spaees sinee distanees tend to beeome similar (one of the aspeets of the “eurse of dimensionality”). 

Principal component analysis (PCA) 

In this technique, given a data set of observations on correlated variables, an orthogonal transformation 
is performed to convert it into a set of uncorrelated variables called the principal components. The number 
of principal components is less than or equal to the number of original variables. This transformation is de¬ 
fined in such a way that the first principal component has the largest possible variance. One rule of thumb 
is to consider those components whose eigen values are greater than one in the reduced space. Principal 
components are guaranteed to be independent only if the variables are jointly normally distributed. 

Independent component analysis (ICA) 


Principal component analysis. Factor Analysis, Projection Pursuit are some popular methods based on 
linear transformation. But ICA is different because it looks for the components in the representation that 
are both statistically independent and non Gaussian. ICA separates statistically independent components, 
which are the original source data, from an observed set of data mixtures. All information in the multivari¬ 
ate data sets are not equally important. There is often a need for extraction of the most useful information. 
ICA extracts and reveals useful hidden factors from the whole data sets. ICA defines a generative model 
for the observed multivariate data, which is typically given as a large database of samples. Contrarily to 
PCA, the components are not imposed to be orthogonal. 


Independent Component Analysis (Comon[ 19941, model assumes the form 


X = AS 


( 1 ) 


where X is a data matrix, A is the non-singular mixing matrix, S is matrix of independent components. 

is the unmixing matrix. The main goal of ICA is to estimate the unmixing matrix A~^ . It is assumed 
that the data variables are linear or non-linear mixtures of some latent variables and the mixing system of 
equation can be written as 

~ (^ilS\ “f Cli2S2 T . 4" ClinSni ^ ~ 1, 2, ..., Ti (2) 


The Si are mutually independent and are the entries of the non-singular matrix A. Here n is the 
number of parameters (variables). For performing ICA, the data set has to be whitened in the sense that 
correlations in the data have to be removed. 


There is no rigorous method to determine the optimum number of ICs. For instance, the number of 
independent compone nts can be taken to be equ al to the number of principal components with eigen 
values greater than 1 (|Albazzaz and WangH2004|). As most of the data sets in Astrophysics are likely to 
be non Gaussian, ICA can be successfully used in many situations ( [Chattopadhyay et aL]|2013a|b| ). 


2.2 APPLICATIONS 


PCA technique was applied in a few papers in the 1970s and 1980s with the goal of finding the main 
parameters explaining the variance among galaxy samples. For instance, |Watanabe et al.| (jl985]) used 
four parameters (diameter, magnitude, mean surface brightness and mean concentration index) and found 
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that two principal components explains 97% of the total varianee in their sample of all morphologieal 
types, in agreement with other studies. While [Watanabe et al.|(|1985|) do not find differenees in the two- 


dimension PC plane between elliptieal and disk galaxies, | Whitmore (|1984|) more explieitly looks for an 
objeetive elassifioation of galaxies: “ The faet that there are so many different elassifieation systems for 
galaxies...demonstrates that we are still searehing for the fundamental properties.”. Using more parameters 
(up to 15) they agreed with the other studies on two eomponents explaining most of the varianee, and 
tentatively identify them as seale and form. They do not devise a new elassifieation seheme, but rather 
identify different eorrelations depending on the position of the galaxies on the 2D diagram. 


Chattopadhyay and Chattopadhyay (20061 also found two eomponents in a PCA analysis of samples 
of spiral galaxies with extended rotation eurves. They eonstrueted new “fundamental planes” with these 
eomponents, pinpointing the most important physieal faetors. They also performed a multiple stepwise 
regression analysis of the variation of the overall shape of the rotation eurves and find that it is mainly 
determined by the central surfaee brightness, while the shape purely in the outer part of the galaxy (beyond 
the optieal radius) is mainly determined by the size of the galaetie disk. Sueh a regression is interesting to 
prediet still unobserved values for some parameters, and is improved by the reduetion of the dispersion in 
the prineipal eomponent spaee. 


Peth et al.| (|2015|) used PCA as a simple way to reduee the dimensionality, break internal degeneraeies 
and hnd the natural distributions of data in the parameter space eharaeterizing the struetures and shapes 
of galaxies that they study. These prineipal components are then used to elassify the shape of galaxies 
through a hierarehieal elustering teehnique (see Seet.[TT|). 

Several studies (e.g. |Connolly et al.[ |1995|) used PCA both as a dimension reduetion and as a tool 
for elassifieation of speetra of gafaxies. Spectra are characterized by a high number of attributes (the 
wavelengths) that are not independent since a spectrum is made of a continuum spectrum from stars plus 
absorption and emission lines from the gas. PCA has in principle the po wer to identify the min imum 


number of spectra to combine in order to obtain the observed diversity. (Connolly et al. 1995|) used 


a variant of the PCA technique, the Karhunen-Love transform, which allows for weighting differently 
some parts of the spectra. They not only find that two eigenspectra are necessary to account for most 
of the variance of the spectra of galaxies, but the distribution of classes in the two-parameter space is 
one-dimensional. They proposed a scheme of ten classes, some corresponding to the broad morphological 
types Sa, Sb, SO and E, while the six others are starburst objects. Their work was intended to be used by 
spectral surveys to classify automatically the observations. 


In a similar scope of general classification of galaxies, one must mention the attempt by |Scarlata et al. 
(|2007|) to build a m orphological automated classification of ga laxies, the ZEST cata log, using PCA (see 
(Joppa et ahj 120111) but the parameters used are criticized by lAndrae et al.l (|2010|). This illustrates the 
importance of the selection of the parameters for a multivariate clustering or classification analysis which 
at some point may appear arbitrary and subjective. A special care should be brought to this initial step 
through the analysis of the data set itself with dedicated data mining tools. 


Another instance is the classification established by |Conselice| (flOOb]) using a PCA analysis together 
with a Spearman Rank correlation test to better understand the parameters of the data set. His approach is 
to use the PCA on some set of parameters and then understand the physics of the principal components. 
So the PCA shed light on the underlying physics from which a classification scheme can be built. He 
finds three dimensions for this scheme, with the mass (scale), the star formation (spectral type) and the 
interactions/mergers (degree of dynamical disturbances). This should remind that PCA is not a clustering 
technique per se, it provides a new re presentation of the data from which a clustering may be performed. 
Indeed the work by Conselice|(2006) proposes new relationships between the morphological classes. His 


scheme appears as a more physical replacement of the 2D Hubble tuning fork diagram. 

The Princi pal Component Analysi s assum es a linear combination of the parameters, a rather strong 
assumption. Taghizadeh-Popp et al.| (120121) have used a non-linear PCA, the Principal Curve analysis. 


‘which can be seen as a nonparametric extension of linear PCA. The principal curve is the curve follow¬ 
ing the location of the local mean in the multi-dimensional cloud of data points.” They obtain a drastic 
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dimension reduction with a one-dimension parameter space (the Principal Curve) which they divide arbi¬ 
trarily into 20 groups of equal densities. They compute a distance (the arc length) that ranks the galaxies 
so providing a natural and objective way of ordering, p artitioning and classifying the rich zoo of galaxies 


in the nearby universe. Taghizadeh-Popp et al. (|2012) do not include luminosity nor mass in the process 


in order not to bias the study of the luminosity function as a function of the arc length. This is debatable 
but they are right in saying that it would induce a bias since these parameters will define a strong axis 
of variance in the PCA. Nevertheless would it be possible to classify galaxies without their mass? Could 
massive galaxies have the same history as less massive ones? This shows that the choice of the parameters 
is never so obvious, and generally related to the choice of the technique used as well. The interesting point 
is that they recover known trends in the physics of galaxies, but more importantly they can identify new 
kinds of galaxies pointing out particular physical processes and histories of galaxies. These discoveries 
can only be made by multivariate analyses. 


Folkes et al. (1996) applied PCA on spectra of low signal-to-noise ratio mainly as a dimensionality 
reduction technique. The few principal components are then used to train a neural network in order to 
classify galaxies into the five broad morphological types. Even though this approach is efficient for big 
data sets, it appears limited to normal galaxies since they find that a new classification scheme must be 
used where unusual features are present in the spectra. 

The ICA analysis is still less common than PCA for the stu dy of galaxies. At least two studies have 
been published, an ensemble learning for ICA (|Lu et al.[ [2006]) and a mean field independent component 
analysis ([Allen et al.]|2013|). In the first case, V52b synthetic spectra have been used coming from Single 
Stellar Population models. They select 74 ’’sufficiently” different spectra from these (using an objective 
criterion) since the ensemble learning part converges very slowly. The ICA analysis yields six most signif¬ 
icant components, and the 1326 spectra are fitted on these components. Each component represent a basic 
element behind the spectra of galaxies, and they find that each of them can be associated closely to one or 
a few stellar types plus some peculiar line properties. These six components are then used on real galaxy 
spectra to derive the stellar contents like starlight reddening, stellar velocity dispersion, stellar content, 
and star formation history. Even though PCA is much faster, it does not provide this important information 
because of the orthogonality constraint that does not allow the components to be non-negative. 


Allen et al.[ ([2013]) used the mean field ICA which is a probabilistic ICA using a prior to constrain the 
components. I'hey find that ten components (divided into five continuum and five emission components) 
are required to produce accurate reconstructions of essentially all narrow emission-line galaxies to a very 
high degree of accuracy. Using these ten components on a large sample of Sloan Digital Sky Survey 
(SDSS) galaxies, they identify the regions of parameter space that correspond to pure star formation and 
pure active galactic nucleus (AGN) emission-line spectra, and produce high S/N reconstructions of these 
spectra. 


In a similar fashion, [Hurley et al.[(]2014]) applied the Non-negative Matrix Eactorization technique which 
has been developed for blind source separation problems. Unlike PCA, this technique imposes the condi¬ 
tion that weights and spectral comp onents are non-n egative that is also possible in the ensemble learning 
approach for ICA described above (|Eu et al.[ [2006]). This more closely resembles the physical process 
of emission in the mid-infrared region studied in this work, resulting in physically intuitive components. 
They find seven such components, including two for active galactic nucleus emission, one for star forma¬ 
tion, and one for the rising continuums at longer wavelengths. They show that the seven components can 
be used to separate out different types of objects (see Sect. 0 and to separate out the emission from AGN 
and star formation regions and define a new star formation AGN diagnostic which is consistent with all 
mid-infrared diagnostics already in use but has the advantage that it can be applied to MIR spectra with 
low signal-to-noise ratio or with limited spectral range. 
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3 SUPERVISED AND UNSUPERVISED LEARNING 

3.1 DISTANCES/DISSIMILARITIES 

A lot of learning techniques require a dissimilarity measure. Among them, the distances obey the well- 
known triangular properties and define a metric. In hierarchical clustering, the distances mainly come 
from a very general distance known as the Minkowski’s distance or the pth norm, which may be defined 

as follows. For two points P = {xi,X 2 , . ,Xn) and Q = (yi, 1 / 2 , . ,yn) in the n dimensional space, the 

pth norm is given by 


/ n \ Vp 

(3) 

For p = 1, it gives the Manhattan distance (Li norm). For p = 2, it reduces to the Euclidean distance (L 2 
norm). Also for p = 00 , the norm results in Chebyshev distance. In hierarchical clustering. Euclidean 
and Manhattan distances are mainly used. But these measures are applicable only to continuous data. For 
categorical or binary data other distances must be used but will not be addressed in this paper. 

It may be noted that the selection of the appropriate distance matrix for clustering problems completely 
depends on the physical situation. 


3.2 SUPERVISED LEARNING (CLASSIFICATION) 

Supervised learning technique may be viewed as a mapping between a set of input variables and an output 
variable. This mapping is applied to predict the outputs for unseen data. The main characteristic of super¬ 
vised learning is the availability of annotated training data. It supervises the learning system to instruct 
on the labels to associate with training examples. These labels are known as class labels in classification 
problems. Supervised learning induces models for the training data and these models are then used to 
classify other unlabeled data. Two most popular smervised learning techniques are the Nearest Neighbor 
(Sect.Q and the Support Vector Machines (Sect.[^ classifiers. 


3.3 UNSUPERVISED LEARNING (CLUSTERING) 


The unsupervised learning or clustering seeks some pattern in the data set by starting from the raw data 
with or without any distributional assumption regarding the underlying distribution. The three main cat¬ 
egories of this kind are (i) connectivity based clustering (like hierarchical clustering, see Sect. [TT] ), (ii) 
centroid based clustering (like k-means, see Sect. and (iii) density based clustering (like DBSCAN or 
more generally kernel density estimation). 


An overview of these approaches can be found in D’Abrusco et al.|(20121 with many references of appli¬ 
cations in astrophysics. Most of the methods that \ve present in the following are unsupervised clustering. 
The reason is that the multivariate analyses of galaxies essentially are either supervised approaches based 
mainly on dimension reduction techniques (mostly PCA, see Sect.|^ or unsupervised methods to discover 
new classification schemes of galaxies which are really objective and multivariate. 


4 NEAREST NEIGHBOR 

4.1 METHOD 

The fc-Nearest Neighbor (NN) algorithm is very intuitive. It starts from a training set for which we have 
the class labels. In order to make a prediction about a new observation, one looks at the labels of its k 
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Figure 1. In this k-Nearest Neighbor illustration with k = 5, the central black square more probably 
belongs to the blue class. 


nearest neighbors and uses a majority vote to make the prediction (Fig. [^. As the number of neighbors 
used in making the prediction increases, the decision boundaries become smoother, the bias increases, but 
the variance decreases. 


4.2 APPLICATIONS 


Ball et al. (2007) explored the k-Nearest Neighbor technique for determining photometric redshifts in 
petascale databases using 55 746 quasar spectra from the SDSS. The algorithm is trained on a repre¬ 
sentative sample of the data. The main result is that the comparison between the photometric and the 
spectroscopic redshifts shows no region of catastrophic failure where the two derived values differ a lot, 
contrarily to other methods used to derive photometric redshifts. 


5 SUPPORT VECTOR MACHINE 
5.1 METHOD 

Support Vector Machine (SVM) aims to find the hyperplane that best separates two classes of data through 
an optimization method. Instead of using just a standard orthogonal basis, SVM uses many functions to 
describe good separating surfaces. The input data are viewed as sets of vectors, and the data points closest 
to the classification boundary, determined from a training sample, are the support vectors. SVM funda¬ 
mentally separates two classes of objects which is probably a limitation in its use for the classification of 
galaxies. 

They use optimization methods to find surfaces that best separate categories. Their key innovation is to 
express the separating surfaces in terms of a vastly expanded set of basis functions. Instead of using just a 
standard orthogonal basis, SVMs use many basis elements. 


5.2 APPLICATIONS 


SVM has been used by Huertas-Company et al. (2008) for the morphological classification of galaxies 
from the COSMOS survey. The training sample is a limited sample classified visually using a 12- 
dimensional volume, including 5 morphological parameters, and other characteristics of galaxies such 
as luminosity and redshift. The objective is to be able to classify automatically the results of big sur¬ 
veys. However, the result seems a little bit disappointing since it can only separate between the two broad 
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classes of early- and late-type galaxies, with an error of about 20%, even though this is better than other 
methods generally used. 


6 K-MEANS 

6.1 METHOD 


The k-means algorithm (|MaoQueen], |1967[ jGhosh and Liu] |2010|) is a partitioning approaeh that starts 
with k eentroids, k eorrespondmg to me nurnber of elusters given a priori. It then assigns eaeh data point 
to the elosest eentroid and when the elusters are built, the new k eentroids are eomputed and the proeess 
iterates until eonvergenee (Fig.[^. The result depends very mueh on the initial eentroids. Repeating the 
analysis with several initial ehoiees is always a good idea, but eonsisteney is not guaranteed if the data 
do not eontain distinguishable and roughly spherieal elusters. Some strategies have been devised to g uess 


the best initial ehoiee for the eentroids (e.g. jSugar; 
many indiees are available in the paekage IsbClust ( jCharrad et al.[ 2014p of R ( |ream[|2014p . 


and Jamesl|2003 ; T^junisha and Saravanani [20T0| ) and 


A variant ealled the k-medoids algorithm (Kaufman and Rousseeuw, 1987; Reynolds et al. 20061 


ehooses data points as eenters (medoids) and is known to be more robust to noise and outliers. 



Figure 2. A typie al result of a k-means analysis in whieh the elusters (four here) are elearly 
distinguishable (from[Fraix-Bumet et al.[|2010|). 


6.2 APPLICATIONS 


The k -means algorit hm has been used in the eontext of stars (e.g. Gratton et al. 2011 Simpson et al. 
2012]), galaxies (e.g.jFraix-Bumet et al.]|2010[|Sanehez Almeida et al.l|2010t|Fraix-Bumet et al.||2012|) or 
Gamma-Ray Bursts ( jChattopadhyay et aH|2007p 7 


Sanehez Almeida et al. (2010) performed a k-means analysis of a large number (788 677) of speetra 
from the SDSS. Eaeh speetra is a eolleetion of about 4 000 wavelengths, making the full data set very 
eomputationally demanding for a direet k-means. They thus deeided to limit the speetra to a priori infor¬ 
mative regions, redueing the number of “parameters” to 1637. Their analysis is affeeted by the dependenee 
of the result on the seed. They say that estimation tools for the number of elusters eould not be applied 
beeause of the sample size. Using some eriteria, they end up ehoosing randomly one elassifieation having 
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28 classes. The result looks more like a eontinuum distribution of speetra, and even if not shown, over¬ 
lapping between elasses is important. This questions the validity of the k -means approaeh in this ease as 
another k-means analysis of the same sample has shown (|De et al.l|2014[). 


Multivariate k-means analyses of smaller sample of galaxies with the aim of dise overing new elasses 
of gal axies have been performed as a eomplement to other eluster ing methods by [Fraix-Bumet et al. 
(|2010|) with the four parameters of the fundamental plane, and by |Fraix-Burnet et al.| (|2U12|) with six 
parameters seleeted from 23 available. In the latter ease, the seleetion of the parameters is maae through 
different statistieal tools, in order to find a parameter subspaee in whieh a robust elustering of the data is 
present. This leads to the important result that several very different elustering teehniques yield eompatible 
elusterings, giving good eonfidenee to the result. The astrophysieal implieations are numerous sinee a new 
elassifieation is established and the average properties and the eorrelations varies from group to group and 
often differ from those of the global sample. However, the interpretation benefited from the relationships 


between the elasses established by the phylogenetie method used in these works and diseussed in Seet. 13 


Even though the elusters are similar, the absenee of these links in the k-means results is elearly missing. 


Chattopadhyay and Karmakar| (|2013|) performed a k-means analysis of a large sample of dynamieally 
hot stellar systems from globular elusters to giant elliptieals, in quest of the formation theory of ultra 
eompaet dwarf galaxies (UCDs), using three parameters (logarithm of stellar mass, logarithm of effeetive 
ra dius and stellar mass to light ratio). The number of elusters, five, is given by the optimum eriterion 
of|Sugar and James |(|2003|). The elassifieation of UCDs provides some new elues to the long diseussed 
hypothesis that these objects may be formed either as massive globular clusters or have an origin similar 
to nuclei of dwarf galaxies. 


7 FUZZY CLUSTERING 

7.1 METHODS 

In non-fuzzy or hard clustering, data is divided into crisp clusters, where each data point belongs to exactly 
one cluster. In fuzzy clustering, the data points can belong to more than one cluster, and associated with 
each of the points are membership grades which indicate the degree to which the data points belong to 
the different clusters. Many algorithms exist, many of them being extension of hard clustering algorithms. 
One example is the fuzzy C-means which is very similar to the k-means (Sect. but adding a weight 
between 0 and 1 to each point characterizing its probability to belong to a given group, and a degree of 
fuzziness of the groups. 


7.2 APPLICATIONS 


Coppa et al. (2011) studied the bimodality of galaxies which comes from a double peak distribution in 
some scatter plots, particularly in color-color diagrams. The origin of this bimodality and the relationship 
between the two broad classes, “red” and “blue” or “late type” and “early type”, is still not understood. 
Evolution is probably involved, but then what is the status of the overlapping regions called “the green 
valley”? To know whether this bimodal distribution is an intrinsic property of galaxies and their ev olu- 
tion, multivariate analysis must be used since it appears in several scatter plots. |Coppa et al.|(|2011|) use 
an unsupervised fuzzy partition clustering algorithm applied on the principal cornponents of a ECA anal- 
ysis. They use eight parameters, two coming from spectra, one from photometry and five describing the 
morphology. They keep three principal components to perform the clustering analysis which proceeds 
in two steps: a modified fuzzy k-means algorithm to guess the memberships and the cluster centroids, 
and a second algorithm (fu zzy modification of maximum likelihood estimation) to achieve optimal fuzzy 
partition (see references in|Coppa et al.[|20lT]). 


They decide to identify three clusters, blue, red and green, somewhat giving up the fuzzy nature of their 
study. In addition, they name the clusters after previous classifications, even though “the ’early type cluster 
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is not intended to be made up of pure passive galaxies; rather, it is eomposed also by bulge-dominated 
weakly-star forming objeets.” This is a quite eonfusing praetiee espeeially beeause they diseover some 
new kinds of objeets whieh are invaluable for our understanding of the physies and evolution of galaxies. 


Bayesian approaehes ean also be seen as soft olassifieation as illustrat ed for instanee in th e separa¬ 
tion between star forming galaxies and Aetive Galaetie Nuelei (AGNs) in|Norman et al.|(|2004|) to avoid 
eonfusion between different kinds of objeets. 


8 INFORMATION BOTTLENECK TECHNIQUE 

8.1 METHOD 


The Information Bottleneek Method (Tishby et al.[ 2000) is a simple optimization prineiple for a model- 
free extraetion of the relevant part of one random vanable with respeet to another. The algorithm is 
extremely general and may be applied to different problems in analogous ways. A great advantage of 
this unsupervised elustering teehnique is that it avoids the arbitrary ehoiee of the distanee and provides a 
natural quality measure for the resulting elassifieation. 


Using the mathematieal notations of Slonim et al. (2001) that applied this teehnique to galaxies, the 
optimal olassifieation is given by maxirnizing the functional: 


C\p(c\g)]=HC;A)-r'HC;G) (4) 

where C represents the classes, G the galaxy sample and A the spectral wavelengths. I{C;A) and I(C;G) 
are the mutual information between G;A and G; G. I3~^ is the Lagrange multiplier attached to the com¬ 
plexity constraint. For /3 —0 the classification is non-informative, and for —)■ cx) the representation 
becomes arbitrarily detailed. 


8.2 APPLICATIONS 


Slonim et al. (2001) explain that by normalizing the total photon counts in each spectrum to unity, we can 
consider it as a conditional probability, the probability of observing a photon at a specific wavelength from 
a given galaxy. The ensemble of spectra can thus be seen as a conditional probability distribution function 
that allows to undertake the information theory-based analysis. For any desired number of classes, galaxies 
are classified such that the information content about the spectra is maximally preserved. 

The number of classes is an issu e in most unsupervi sed clustering techniques, and the information 
bottleneck shares this difficulty too. |Slonim et al.|(|2001|) note that ’the true or correct number of classes 
may be an ill-defined quantity for real data sets and the number should be determined by the desired 
resolution, or preserved information". However one should be careful to use objective arguments based 
only on statistics, since the physical interpretation should come at the end to tell whether or not the result 
is interesting. 


The main results of this study is the demonstration that an objective and automated technique can yield 
a classification of spectra which is very physical, in the sense that it recovers results obtained more clas¬ 
sically, but is able to discover other classes and correlations between physical parameters. An interesting 
point in their study is that they applied the same techniques to two samples, one observed and one sim¬ 
ulated. The good agreement between the two clusterings shows that the models of galaxy evolution are 
sensible. This is a good approach to test the models by statistically comparing two populations using 
multivariate data sets. 
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9 MIXTURE MODELS 
9.1 METHODS 


Most partitioning methods use a distance to define the clusters. In model-based clustering methods, each 
cluster can be represente d by a parametric distribu tion, the data set being thus considered as a mixture of 
such model distributions (|Qiu and Tamhane[|2007|). The parameters include the mixing proportions or the 
prior probabilities of the clusters since the tnie cluster memberships of the observations are unobserved. 
The optimization relies on the likelihood of the weighted linear combination of the cluster distributions 
through the Expectation-Minimization (EM) algorithm. Clustering is done by applying the maximum 
posterior (Bayes) rule. The process yields a soft classification (probability of membership) and a fit to 
each cluster distribution. 

The mixture model approach also provides expected misclassification probabilities. It requires the num¬ 
ber of cl usters to be known, which ca n be for instance estimated with the tools developed for the k-means 
analysis ( [Chattopadhyay et aL||2009[ Sect.j^. 


9.2 APPLICATIONS 


Davoodi et al.|(j2006l) find four Gaussian distributions best fit the color distribution of 16 698 extragalactic 
infrared sources. They use this result to propose a classification scheme (Ca to Cd) of galaxies that reveal 
a greater variety of galaxy types than usual spectral energy distribution fitting techniques tha t strongly 
depends on the quality of the template model components. Interestingly, [Davoodi et al.|(|2006|) use their 
soft classification to identify outliers (rare galaxies or transient phases) by summing up the four probability 
density functions for each object. 


Hurley et aLl(f20l4l) used the seven components they have found with a dimension reduction approach 
eii 


(Sect. 1^ to define a parameter space in which they apply an unsupervised Gaussian Mixture Model clus¬ 
tering mgorithm in order to provide a classification tool. This clustering approach is a fuzzy approach 
since clusters describe a probability density function indicating how likely a galaxy could be found in any 
one of the clusters. Eight clusters are found which are consistent with previous classifications. Strangely 
enough, these clusters are named according to the classical classification through a majority rule. We may 
ask why use an unsupervised technique if one believes in an existing ”true“ hard classification? 


10 WAVELET ANALYSIS 
10.1 METHOD 

The wavelet transform is a well known signal analysis technique widely used in many research areas. Its 
key property is the ability to provide a multi-resolution approximation of a given input signal through a 
prototype function ^: 


W{s,r) 




(5) 


where s characterizes the scale and r the translation factor. The prototype function, also called the mother 
wavelet, is continuous in both time and frequency and serves as the analysing window. 


With this definition, wavelets appear as a parametric-model decomposition of a da ta set using som e 
basis functions. They could then be used for dimension reduction and/or classification ([Thuillardt 2001[). 


Shapelets are a scaled version of two-dimensional Gauss-Hermite polynomials and form a set of com¬ 
plete basis functions that are orthonormal on the interval [—cx), oo]. Shapelets are thus suited to decompose 
images. Eor galaxies, their use is limited to high signal-to-noise data and rather regular galaxies since they 
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are gaussian-shaped and spherical (Andrae et al. 20101. The composition is an automatic and objective 
representation of galaxy morphologies. 

Other multiresolution methods have been proposed, like for i nstance the hierarchical Ma rkov models 
extended for the multispectral astronomical image segmentation (Collet and Murta^ 20041. 


10.2 APPLICATIONS 

Wavelets can be used to decompose galaxy spectra into several features that can then be used to classify 
the spectra. In this sense they serve as a dimension reduction technique but contrarily to PCA or ICA the 
basic elements (features) can be chosen to be physically meanin gful, representing the three compon ents 
of spectra: the continuum, the emission and absorption lines (e.g.jStarck et al., 1997[ Liu et al. 20051. 


Andrae et^L] (|20T0l) review how an automatic classification of galaxy morphologies could be done us- 
TTneir go 


ing shapelets. Their goal is not to devise a new classification, since it is extremely difficult to parametrise 
arbitrary galaxy morphologies apart from the question that the morphology is only one property of galax¬ 
ies. To address the parametrisation problem, they use shapelets and then define the distance as the angle 
spanned by their (normalised) coefficient vectors of the shapelets: 


d{xi,Xj) = a.Tccos{xi ■ xj) 


( 6 ) 


They then use a soft (fuzzy) clustering algorithm with the similarity matrix given by: 




{d{Xi, Xj)/d 


max 


a 


S 


(7) 


with dmax being the maximum distance between any two objects in the given data sample, and a > 0 and 
s > 1 being free parameters that tune the similarity measure. This probabilistic clustering technique uses 
the graph theory in which the similarity elements Wmn the weights of the edges. 

They also evaluate the impact of hard clustering methods on the estimation of the parameters charac¬ 
terising the classes depending on the level of overlapping. This is an important point to keep in mind in 
all hard (non-fuzzy) approaches to clustering, be it by hand or algorithmic. They even suggest that the 
processes of galaxy evolution and observations tend to invalidate hard clustering approaches. 

They do not go into the details of the astrophysical interpretation, but they clearly demonstrate the 
advantages of such sophisticated approaches for automatic morphological classification of a huge number 
of galaxies. However, as they rightly say, “a lot of work is still needed on the interpretation of the results.” 


11 HIERARCHICAL CLASSIFICATION METHODS 

11.1 METHODS 

The hierarchical classification method builds a hierarchy of clusters. Two main approaches to form the 
hierarchy are agglomerative and divisive. In the agglomerative approach each observation is considered 
as a cluster and pairs of clusters are merged as one moves up the hierarchy (see Fig. |^. The most similar 
objects are grouped first and those initial groups are merged ultimately into single muster according to 
some proximity measure. These proximity measures are based on either similarities or dissimilarities 
(distances). In the divisive analysis approach all observations at first are grouped in one cluster, and splits 
are performed recursively as one moves down the hierarchy. Here an initial single group of objects is 
divided into two subgroups such that the objects in one subgroup are far from the objects in the other. 
These subgroups are further divided into dissimilar groups until there are as many subgroups as objects. 
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Figure 3. An example of a dendrogram. Distanee between two horizontal branehes is going from left to 
right. The two dashed lines illustrates two euts yielding 9 or 3 elusters. 


In order to deeide whieh elusters should be eombined or where a eluster should be split, a distanee 
matrix is required. The distanees used for hierarehieal elustering are mainly Euelidean and Manhattan for 
eontinuous type data. In order to find distanees between elusters different linkages like single linkage, 
eomplete linkage, average linkage ete are used. Note that the nature of the final elusters totally depends 
upon the ehoiees of distanees and linkages. 

It is interesting to note that if the metrie used is the single linkage, then this method is similar to the 
Minimum Spanning Tree teehnique (Seet.fT^. 


11.2 APPLICATIONS 


Peth et al. (2015) applied a Ward hierarehieal agglomerative elustering to elassify galaxies in distinet 
groups using the first three prineip al eomponent eig enveetors. In this Wnd of approaeh the number of 
groups is ehosen after the analysis. |Peth et al.|(|2015^^ seleeted ten groups as a eompromise between too 
many small groups whieh might appear as too speoifie, and too large ones that would smear out the true 
diversity of the objeets. They also try to define boundaries to these groups in the PC-spaee by fitting a 
eonvex hull around the points within eaeh groups in order to elassify future new observed objeets. How¬ 
ever, a Nearest-Neighbor or SVM teehnique eould be used in this purpose without the need to eompute a 
eonvex hull whieh is a rigid boundary. It is important to reeall that a elassifieation is never definitive and 
would probably evolve with the inelusion of new objeets, as it has been for instanee the ease for the SO 
(lentieular) morphologieal elass of galaxies whieh were not present in the original Hubble elassifieation. 
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One of the main results of the studies by Peth et al. (2015|) is a refined and objeetive elassifieation of 
struetures and morphologies of the galaxies in their saniples. The ten groups are analyzed separately to de¬ 
rive their properties and their probable evolutionary status and history. Their seheme separates quenehed 
eompaet galaxies from larger, smooth proto-elliptieal systems, and star-forming dise-dominated elumpy 
galaxies from star-forming bulge-dominated asymmetrie galaxies. It also reveals a higher fraetion of 
bulge-dominated galaxies than visual elassifieation or one based on the Sersie index. 


Deeision trees are a praetieal use of hierarchieal elustering. Sanehez Almeida et al. (12012 1 propose 


a deeision tree to elassify galaxy speetra aeeording to some general features that usually serves as a 
elassifieation of galaxy properties. Th ey use the deeision tree on thei r previously ASK elasses determined 


with the k-means teehnique (Sect. 
their new classes on another classification. 


Sanchez Almeida et al.[|201()| ). Somehow, in this way, they classify 


Suchkov et al.| (|2005|) have applied an oblique decision tree classifier on the homogeneous multicolor 
imaging data base of the SDSS, the classifier being trained on subsets of objects (stars and galaxies) 
whose nature is precisely known via spectroscopy. Each node in the decision tree is a criterion on one 
parameter, defining an hyperplane parallel to one of the axis. In an oblique decision tree, the criterion is 
based on a (linear) combination of parame ters, so the tree is no more parallel to any of the axes in the 
parameter space. In |Suchkov et al.| (|2005j) the classifier is composed of ten oblique decision trees and 
the final decision is made by votes which yield a class probability distribution for a given object. The 
main result of their study is to show the ability of this approach to automatically classify objects from 
the photometry instead of the spectroscopy which is harder to obtain and analyse, and accurately predict 
the redshifts of both normal and active galaxies. This can increase considerably the samples required to 
analyse statistically the evolution and diversity of galaxies, their properties and their correlations. 


12 MINIMUM SPANNING TREE 

12.1 METHODS 


The Minimum Spanning Tree (MST) is mathematically related to the single link age clustering, known to 
astronomers as the f r iends- of-friends algorithm or Nearest Neighbor algorithm (|Gower and Ross 


Feigelson and Babu 


I 


1969 

TJfof 


20121). A spanning tree is an acyclic, connected graph G which is a set ( 
vertices (nodes) and edges (branches) together with a function w : E ^ R that assigns a weight w{e) t 
each edge e in E. The minimum spanning tree (Fig.|^, is the spanning tree T minimizing the function : 


e € r 


( 8 ) 


If the weights w{e) are distinct, then the solution is unique. A number of algorithms hav e been developed 
to solve exactly the Minimum Spanning Tree problem. The first algorithm is attributed to|Boruvka|(|1926|). 
Other popular algorithms are Prim’s, Krukal’s and the Reverse-Delete algorithms that all hnd solutions m 
polynomial time. The above algorithms also work at higher dimensions in which case the Euclidean F2 
or the Manhattan El distances are generally used. 

Minimum spanning trees have found applications in phylogeny, computer vision, and cytology just to 
name some domains. It has been used in astrophysics, and maybe very early since a large number of 
const ellations defined by early civilizations are also shown to correlate well with a Minimum Spanning 
Tree (|Dry et am2009|). 
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Figure 4. An illustration of a Minimum Spanning Tree linking ten nodes. 


12.2 APPLICATIONS 

The Minimum Spanning Tree teehnique has been heavily used to determine the galaxy elusters in order 
to map the spatial distribution of the bary onie matter, a visible signature of the gravitationa l strueture of 


the Universe shaped by the Dark Matter ([Barrow et al.[ 1985, Bhavsar and Splinterl|1996 ). This is not 


an applieation to elustering in the sense of elassiheation, but this is a spatial elustermg. Indeed the MST 
approaeh has been strongly adapted to the partieular eonstraint of eosmologieal observations: the exaet 
position along the line of sight is only approximately given by the redshift. We do not diseuss any further 
this applieation whieh is not in the main seope of this paper. 


We know of only one use of MST for galaxy olassifieation. Aseasibar and Sanehez-Almeida (20111 
applied this teeh nique to understand their ASK e lassiheation of SDSS speetra obtained with the k-means 
method (Seet. [Sdnehez Almeida et al.[ |2010| ). They find that the majority of the speetral elasses are 


distributed along a well-defined braneh going from the earliest to the latest types, with optieally bright 
aetive galaxies forming an independent braneh that interseets the main sequenee exaetly at the transition 
between early and late types. This deseription is already an interpretation of the 23 ASK elasses that 
present a regular distribution of their speetra as already mentioned in Seet. so that the very linear 
strueture of the MST tree is not surprising. However, the approaeh is interesting beeause this is a rather 
simple and objeetive method to obtain relationships between elasses. 


13 PHYLOGENETIC METHODS 

Basieally, all galaxies share a eommon origin whieh is the gathering of baryonie matter as a self gravitating 
objeet. This baryonie matter was very primitive and has subsequently being enriehed and diversified by 
several generations of stars and many transforming proeesses like galaxy interaetions and mergers. There 
are thus obvious evolutionary relationships between different kinds of galaxies as immediately understood 
by Hubble when he diseovered galaxies and established his famous tuning fork diagram. Taking into 
aeeount the galaxy diversity of morphologies known at that time, he built a phylogenetie tree in whieh 
the relationships are due to the evolution of the stellar orbits whieh, he thought, should flatten with time 
beeause of the dynamieal frietion. Even though we now know that this proeess eannot be aeeomplished 
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in a time shorter than the known age of our Universe, this tuning fork diagram is still used to represent 
galaxy diversity. 

Somewhat strangely enough, phylogenetic analyses of galaxy diversity has not been attempted again for 
a century. This is probably because the data did not allow much progress into this direction. But we now 
have huge multivariate databases and it seems timely to reconsider this question. We here present only a 
few techniques, those that have been already used on astrophysical data sets. 


13.1 METHODS 


Before describing some of the most important methods, let us point out that the development of phyloge¬ 
netic methods has been hindered till the 2000s by very heated discussions on the philosophical merits of 
the different approaches. It is only in recent years that most of the barriers between the different schools 
of thoughts could be overcome by a new generation of researchers. Recently a new picture of phyloge¬ 
netic methods is emerging. It becomes nowadays increasingly clear that all the different approaches can 
be discussed within a common framework including distance- and character-based approaches, and that 
this theoretical framework applies both to phylogenetic trees and networks. 

There are two main categories of methods: the distance-based and the character-based. The “characters” 
are traits, descriptors, observables, parameters or properties, which can be assigned at least two states 
characterizing the evolutionary stage of the objects for that character. For continuous parameters, these 
states can be obtained through discretization. 

Distance-based Approaches: Neighbor Joining Tree Estimation 


For distance-based approaches, Neighbor-Joining is the most popular approach to construct a phylo 


1987, Gascuel and Steel 2006 


e- 

is 


netic tree. The Neighbor Joining Tree Estimation (NJ, Saitou and Nei 
based on a distance (or dissimilarity) matrix. This method is a bottom-up hierarchical clustering methods. 
It starts from a star tree (unresolved tree). A “corrected” distance Q{i, j) between objects i and j from the 
data set of n objects, is computed from the distances d{i, j): 


n n 

Q{h j) = {n- 2)d{i, j) - d{i, k) k) (9) 

k=l k—1 

The branches of the two objects with the lowest Q{i, j) are linked together by a new node u on the tree. 
This node replaces the pair {i,j) in the subsequent iterations through the distance to any other object k: 


d{u, k) 


^ [d{i, k) - d{i, m)] + ^ [d(i, k) - d{j, m)] 


( 10 ) 


Neighbor-Joining minimizes a tree length, a ccording to a criteria that can be viewed as a Balanced 


Minimum Evolution (Gascuel and Steel 
algorithm to reconstruct a tree from the 


2006[ ). For a tree metrics. Neighbor-Joining furnishes a simple 
istance matrix. There is a large literature on how to best ap¬ 


proximate a metrics by a tree metrics (see for instance |Fakcharoenphol et al.[ 12003]). Neighbor-Joining 
is justified if the difference between the original distance matrix and the distance matrix describing the 
X-tree obtained with Neighbor-Joining is not too large. 

Character-based Approaches: Cladistics, Maximum Parsimony, Maximum Likelihood. 


Cladistics has been associated in the 80’s to the search of a maximum parsimony tree. Maximum Parsi¬ 
mony is a powerful approach to find tree-like arrangements of objects (Fig.|^. The drawback is that the 
analysis must consider all possible trees before selecting the most parsimonious one. The computation 
complexity depends on the number of objects and character states, so that too large samples (say more 
than a few thousands) cannot be analyzed. The Maximum Parsimony algorithm can take uncertainties or 
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Figure 5. A example tree obtained with eladisties, represented here as unrooted. When a root is ehosen, 
the tree takes the shape of hierarehieal trees. 


unknowns into aeeount by evaluating the different possibilities allowed by the range of values and seleet- 
ing among them the one that provides the smallest seore. In the ease of unknown parameters, the most 
parsimonious diversifieation seenario provides a predietion for the unknown values. 

In reeent years the definition of eladisties has been extended to the elassifieation of taxa (individuals or 
speeies) defined by eharaeters on a rooted tree. In biologieal applieations, a phylogenetie tree deseribes 
the possible evolution of a taxon eorresponding to the root. The root may either be a real taxon or be 
inferred from the deseendant taxa. The sueeess of a eladisties analysis mueh depends on the behavior 
of the parameters. In partieular, it is sensitive to redundaneies, ineompatibilities, too mueh variability 
(reversals), and parallel and eonvergent evolutions. It is thus a very good tool for investigating whether a 
given set of parameters ean lead to a robust and pertinent diversifieation seenario. 


If a set of eharaeters exaetly defines a phylogeny, then the phylogeny is ealled perfeet. In praetieal 
applieations, the available eharaeters seldom define a perfeet phylogeny. A supplementary measure of the 
deviation to a perfeet phylogeny is neeessary to determine how well a eandidate tree fits the eharaeters. 
In the standard approaeh to parsimony, the seore Sp of a tree eorresponds, after labeling of the internal 
nodes, to the minimum number of edges (m, v) with c{u) ^ c{v), c{u) being the ehyaeter state at node u. 
The tree with the minimum seore is searehed for with some heuristies ( |Felsenstem||1984| ). The maximum 
parsimony approaeh ean be direetly extended to eontinuous eharaeters or values. To eaeh internal node is 
assoeiated a real value f{u). The seore s of a tree equals the sum over all edges of the absolute differenee 
between those values: 


s= \f{u)-f{y)\ 

e={u,v)eE 


( 11 ) 


Robinson ( 


19731 has shown that for a tree defined by eontinuous eharaeters, a maximum parsimony 


seore is reaehed for values of the internal nodes belonging to the set of values (or states) defined on the 
leaves. 
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The main method to seareh for the best tree representation of data beyond Maximum Parsimony inelude 
Maximum Likelihood. We note this teehnique whieh has never been applied to astrophysies in the eontext 
of elassifieation but may be a pertinent approaeh. The problem here is that an evolutionary model must be 
used, and naturally the result will depend signifieantly on it. Maximum Likelihood is used standardly in 
biology, and it may be possible that astrophysieists eould also have well eonstrained physieal models of 



based approaehes are fast and ean be used for data exploration and for the seleetion of the most appropriate 
variables. 


Cladisties when applied to domain outside of biology, like in astroeladisties, refers more generally to the 
elassifieation of objeets by a rooted or an unrooted tree (Fig.[^. In that ease, the tree represents possible 
relationships between taxa. The seareh of the best tree deseribed by a set of eharaeters on a set of objeets 
(or taxa in phylogeny) ean be done by several different approaehes. The most popular methods are the one 
using Maximum Pars i mony or Maximum Likelihood. For eontinuous parameters, the software program 


TNT (Goloboff et al.l 2008) is also quite popular to reeonstruet trees from eharaeters. As an alternative. 


the data ean be diseretized through appropriate binning. 


As mentioned earlier, a new pieture of phylogenies is emerging after the understanding that phylogenies 
on multistate eharaeters reduee through a eoneeptually simple grouping of the eharaeters into a phylogeny 
on binary eharaeters. For binary eharaeters, both distanee- and eharaeter-based approaehes are equivalent. 
This approaeh opens new perspeetives as it furnishes also a bridge between eharaeter-based phylogenies 
and split networks or more preeisely outer planar networks. 


Outer planar networks 


Outer planar networks permit t he simultaneous represen tation of alternative trees with retieulations, and 


are thus generalizations of trees (|Huson and Bryantf|2006|). In order to understand the eonneetion between 
outer planar networks and phylogenetie trees, one has to explain sueeinetly what is ealled a split on a 
eireular order of the taxa. A eireular order on a phylogenetie tree eorresponds to an indexing of the n 
end nodes aeeording to a eireular (eloekwise or anti-eloekwise) seanning of the end nodes. A split on a 
eireular order of the taxa is a partition of the objeets into two disjoint sets (Fig.[^. 



c (1,1) 


( 1 , 1 ) 


Split 


Figure 6. A eireular order for objeets A to G, with their pairs of binary states, arranged aeeording to the 
eireular eonseeutive-ones eondition. The two lines show two different split, one between (0,*) and (1,*), 
the other one between (*,0) and (*,1). 
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For multistate characters, a split can be defined after transformation of each multistate character into a 
binary character. For each pair of states (A,B), a subset of states containing the state A is attributed the 1 
state and the complementary subset including the subset B is given the binary state 0. If the tr ansformation 
can be done on each states and characters (for details see Thuillard and Fraix-Bumetj |2015|) so that each 
binary character fulfills the circular consecutive-ones condition, then the data can be described exactly 
by an outer planar network. By definition the circular consecutive-ones condition are fulfilled if for any 
binary state, the taxa with the 1 state are consecutive on the circular order (Fig.|^. 


Splits in an outer planar network (Fig. furnish neighboring relationships between objects. Objects 
sharing a common property, as defined by splits, are consecutive in a circular order. Outer planar net¬ 
works can be regarded as a generalization of phylogenetic trees. An outer planar network reduces to a 
phylogenetic tree if for each pair of binary characters, the so-called 4-gamete rule is fulfilled. The 4- 
gamete rule states that for each pair of binary characters there is at least one of the 4 possible gametes 
(either (1.0), (0,1), (1,1) or (0,0)) that is missing. 



Figure 7. An example of an outer planar network showing the eight splits of the eight parameters si ... 
s8. 


For distance-based approach, the circular consecutive-ones conditions have to be replaced by the 
fulfillment of the Kalmanson inequalities. For taxa indexed according to a circular order, the dis¬ 
tances between a reference node n and the path i — j are gathered in the distance matrix | | with 

^ + ^j,n — dij), di^n being the pairwise distance between leave i and node n. This distance 

matrix fulfils the so-called Kalmanson inequalities: 


yn 

hj 


>Y^ 




( 12 ) 
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Bandelt and Dress 


(1992) have shown that if a distanee matrix 


fulfils Kalmanson inequalities, 
then the distance matrix ean be exaetly represented by a split network or by an X-tree. The program 


YP' 

hj 


SplitTrees4 (Huson and Bryantj |2006|) permits to construct outer planar networks from a distance matrix. 


In practice, the perfect order is not known or not feasible. The difference between the perfect order and 
the order one obtains w ith a given data set is called the contradiction. The minimum contradiction analysis 
(|Thuillard| |2007[ |2008|) finds the best order one can get. It is a powerful tool for ascertaining whether the 
parameters can lead to a tree-like arrangement of the objects (|Thuillard and Fraix-Burnetl, 2009|). Using 
the parameters that fulfil this property, the method then performs an optimisation of the order and provides 
groupings with an assessment of their robustness. 

We believe that outer planar networks will gain importance in applications outside of biology as they 
furnish a real alternative to the standard classification methods. 


13.2 APPLICATIONS 


Farrah et al. (2009) have used a bayesian framework to compare and group 102 ultra-luminous infrared 
galaxy spectra and yield a network diagram which is used to define three groups. An evolutionary descrip¬ 
tion of these galaxies is proposed from the properties of these groups. Even though their method is not 
a phylogenetic technique per se since the relationships are constructed after the clustering analysis, this 
work illustrates the potential need of phylogenetic tools in astrophysics. 


ination of astrocladistics (|Fraix-B 

umet et al.l 

2006b|c|al). Applications have been successfully performed 

for galaxies ( 

Fraix-Burnet et al., 


2012D, globular clusters (|Fraix-Bumet et al.l 20091 Fraix-Bumet 

and Davoust] 

20151) and Uamma-J 

Kay bursts (Cardone and Fraix-Bumet[|2013). 


The phylogenetic approaches used o n galaxy samples are clearly orie nted towards a multivariate and 
evolutionary classification of galaxies (|ITaix-Burnet et al.[ |2010[ f2012|). To this end, several statistical 
analyses (PCA, k-means, cladistics and minimum contradiction analysis) are used to select the set of 
parameters that yields a robust classification according to several clustering analyses (k-means, cladistics 
and Minimum Contradiction Analysis). Six parameters were so selected among the 23 available: the 
central velocity dispersion, the disc-to-bulge ratio, the effective surface brightness, the metallicity, and the 
line indices NaD and OUT The agreement of the clustering obtained by different techniques reinforces 
largely the result. The cladistics tree (cladogram) is used for the interpretation since it also provides the 
relationships between the groups. 

These relationships are hypothesized as being evolutionary so that the placement of the groups on some 
diagrams reflects the evolution of the properties and their correlations. For instance, the famous funda¬ 
mental plane is not universal at all, this 3D correlation clearly depends on the diversification level of the 
group: the correlation becomes tighter when the history of a galaxy is more complex. Other well-known 
correlations, like Mg^ vs the velocity dispersion, indeed disappear within the groups but is created by the 
alignment of the groups in the scatter plot. This strongly suggests that these correlations (known as scal- 
ing relations) are sta tistical and caused by a hidden confounding factor, which is possibly the evolution 
(|Fraix-B urnetl 120 11 1) . 


The new classification is rather easily interpreted with all the parameters available and by comparison 
with numerical simulations. The galaxies within a given group share a co mmon history, that is a seque nce 
of transforming events (collapse, interaction, harassment, merger...) that |Fraix-Burnet et al.|(|2012|) are 
able to identify. 


Outer planar or split networks have also been applied on galaxy samples (Thuillard and Fraix-Burnet 
2009]) even though it is for a theoretical illustration of an optimisation approach to fulfil as much as possi¬ 


ble the Kalmanson inequalities (Eq. 12). Nevertheless, a classification is obtained on this limited sample of 


100 galaxies and with only three parameters. The main splitting character is the surface brightness (Brie) 
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that separates the sample in two roughly equal bins. Each branch is then split into two other branches 
defined by the character states, low OIII, high OIII for the low Brie branch and low B-R, high B-R for 
the high Brie branch. The essential point here is that the split value separating “low” and “high” are not 
arbitrary at all, they are optimized according to an optimisation criterion aimed at obtaining the best split 
network or X-tree as possible. Even though the result cannot be given too much generality due to the small 
sample, the astrophysical outcome is informative. Errst, all high Brie galaxies have high OIII, but some 
high OIII galaxies have low Brie. This means that some low surface brightness galaxies in this sample 
have star formation, and some high surface brightness objects show only an OIII absorption feature Sec¬ 
ond, all high B-R galaxies have high Brie and high OIII. This means that in this sample, the red objects 
have a high surface brightness and some star formation. They are thus not simply ageing galaxies, but 
probably form stars with high metallicity. Conversely, all low OIII galaxies of the sample have a low B-R, 
so that blue objects do not necessarily form a lot of stars. 


14 CONCLUSIONS AND PERSPECTIVES 

In the astrophysical literature, we have found that there is a growing interest for automated classification 
of galaxies, which is motivated mainly by the exploding amount of available data. Eor this purpose, more 
or less sophisticated statistical analyses are recognized to be necessary. In this paper, we have reviewed 
the techniques used so far. We do not claim to be exhaustive, but we think we have described quite a broad 
range of statistical tools. 

Supervised learning analyses are mainly used to separate classes, morphological types or physical com¬ 
ponents in spectra of galaxies. The Principal Component Analysis is the most frequently used, due to its 
simplicity and efficiency, even though it is not a classification technique but rather a dimension reduc¬ 
tion tool. Its attractiveness lies in its ability to perform automatic classification on moderately large data 
sets, and maybe more importantly, its ability to extract simple and important information from multivari¬ 
ate data. In this respect it greatly succeeds in separating spectra of galaxies, quasars and stars in large 
surveys. 

The supervised learning approaches require a classification to be established beforehand. In nearly all 
cases, the traditional morphological classification is the reference. It thus appears that the astronomers are 
keen to devise an objective way of classifying galaxies, using modern tools and multivariate data, but the 
classes to retrieve are devised subjectively with a visual inspection of images in the visible, hence a rather 
monovariate source. 

In the unsupervised learning analyses of the literature, the morphological classification also often serves 
as a reference that must be matched. However, many studies find different classes which bring new insights 
to the physics of galaxies and their evolution. These classes are homogeneous in the multidimensional 
parameter space, and not necessarily in the traditional classification scheme. Because of the number of 
properties to consider, the description of these new classes is more complicated, but simpler (and more 
pertinent) when a comparison with models and numerical simulations is performed. In addition, new ki nds 
of objects are found which would not be possible in a multidimensional parameter space with traditional 
approaches. 

So an automatic classification of galaxies is becoming more and more crucial. The question remains of 
which classification is concerned. The predominance of the morphology as the most important parameter 
associated with the traditional classification scheme, is nearly overwhelming. Most unsupervised learning 
analyses yield new classifications, but this is not really exploited as such since their goal is often to propose 
an automatic way to retrieve the morphological classification. 

We think that this goal is hopeless since it hides a fundamental contradiction between the classifica¬ 
tion obtained from a traditional visual subjective and monovariate approach and the one yielded by a 
multivariate automatic and objective technique. The fact that obvious correlations exist between the new 
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classifications and the traditional one is a very strong support in favor of these advaneed approaehes and 
should not obliterate the differenee in the elasses. 

The astrophysieal results deseribed in this review provide other arguments in favor of the statistieal 
teehniques, mainly beeause these tools ean navigate more easily in a large dimensional spaee: 


• multivariate analyses are partieularly interesting in the ease of speetra, both for supervised and unsu¬ 
pervised elassifieation. Dimension reduetion is here an obvious requirement but proper unsupervised 
elustering is also neeessary to diseover new ki nds of objeets. 

• for speetra, unsupervised teehniques generally do not require fitting with model speetra, so that the 
eomparison between models and observations ean really be performed in the multivariate parameter 
spaee. 

• more generally, the eomparison between the observations, models and numerieal simulations ean be 
made by eomparing the populations eoming from the elassifieations of real and simulated galaxies, 
independently or together. 

• soft (fuzzy), tree- or network-based elassifieations seem more appropriate to the eontinuous 
distribution of galaxy parameters than hard elustering. 

• some teehniques are based on the relationships between the objeets and/or the elasses. It is thus 
possible to objeetively understand for instanee the links between dynamieally hot systems, or the 
plaee of the “green valley” galaxies with respeet to “blue” and “red” ones, or the evolution of galaxies 
within the fundamental plane. 


We eonelude from this review that unsupervised analyses should not be afraid to propose new elassifi¬ 
eations of galaxies. These new elassifieations should be eompared to other sueh elassifieations, this is the 
only way to draw a global view of galaxy diversity and be able to elassify automatieally galaxies of the 
present and future big surveys. In addition, and probably more importantly, the physies of galaxies being 
intrinsieally multivariate, their elassifieation eannot be based on only one eriterion. 

It is important to remember that there is not a unique best elassifieation, and not a best tool. Comparison 
of results is a valuable task sinee it brings a lot of information on the nature of the data, the objeets and 
their parameters. Also a elassifioation is never definitive, and neeessarily evolves with our knowledge and 
the diseovery of new objeets. 


We wish to end this review with the eluster validation question. This is an important issue in elustering 
and elassifieation. In general, eross-validation and bootstrap teehniques are rather easy and provide good 
estimates of eluster robustness. Some other validation indiee s are Dunn s Validation Index (|Dunnl|1973| ), 


Davies-Bouldin Validity I ndex ([Davies and Bouldinj |I979| ), C Index (Hubert and Sehultz[ |I976| ) and 


Silhoutte Validation Index (|Roussee^uw[ 1987). Many more are given in the dusterCrit paekage of R. 


In most of the elustering algorithms, the number of elusters are user speeified. This is a diffieult ques¬ 
tion, there are many tools (Seet. to objeetively guess the optimum number but they all have their 
drawbaeks and limitations. Nevertneless, they should be used as mueh as possible to provide some hints 
and justifieations. 
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